Skip to content

[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971

Open
gmechali wants to merge 7 commits intodatacommonsorg:masterfrom
gmechali:dockerize
Open

[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971
gmechali wants to merge 7 commits intodatacommonsorg:masterfrom
gmechali:dockerize

Conversation

@gmechali
Copy link
Copy Markdown

This PR does 3 main things:

  • It dockerizes the ingestion helper such that we can run it in a Cloud Run Server. Today for base DC we deploy it as a Cloud Run Function, but that isnt ideal for DCP. For DCP, we will deploy a Cloud run server on the datacommon-ingestion-helper image. I have added a cloud build trigger in datcom-ci to rebuild the image everytime we make a change to the master branch on the ingestion-helper directory.
  • Adds the Initialize Database action to the IngestionHelper function, which initializes all the tables in spanner_schema.sql (Edge, Node, Observation, ImportStatus, IngestionLock, ImportVersionHistory, IngestionHistory).
  • Modifies the Lock Acquisition step to insert the lock row for the first time this is run.

As a follow up we should do the following:

  • Migrate the Base DC deployed cloud run function to use this dockerized imaged and deploy it in cloud run server.
  • Remove all the logic from the ingestion pipeline that initializes the database.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a database initialization feature for the ingestion helper, including Docker and Cloud Build configurations, a Spanner schema, and logic to apply DDL statements. The locking mechanism was also updated to handle the creation of new lock rows. Feedback provides a suggestion for better DatabaseAdminClient configuration.

Comment thread import-automation/workflow/ingestion-helper/spanner_client.py
Comment thread import-automation/workflow/ingestion-helper/schema.sql
@gmechali gmechali requested review from dwnoble, keyurva and vish-cs April 20, 2026 17:06
Comment thread import-automation/workflow/ingestion-helper/main.py
Comment thread import-automation/workflow/ingestion-helper/schema.sql
@gmechali gmechali requested a review from vish-cs April 21, 2026 12:36
Copy link
Copy Markdown

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gabe!

I think this tooling may be better suited to live in the DCP repo at https://github.com/datacommonsorg/datacommons

This PR converts the import automation cloud functions to a REST service, but if we're going to do that, I think it makes sense to consolidate it with the DCP FastAPI service so we don't end up adding a separate process.

Then this action could look something like: POST https://<dcp-service>/api/admin/initialize-database

Regardless, i think we should also:

  • convert this toolset from requirements.txt files to uv so we can have consistent builds with the uv.lock files
  • right now the functions in main.py are tied to cloud run functions. we should migrate these to a plain python module, and the cloud run part should be minimal wrapper on top of it
  • we should expose these functions via a command line tool. If we migrate to the dcp repo, this might look like: $ datacommons admin initialize-database or $ datacommons admin release-ingestion-lock.

cc @vish-cs

Comment thread import-automation/workflow/ingestion-helper/spanner_client.py
RUN protoc --include_imports --descriptor_set_out=storage.pb storage.proto

# Run the functions framework
CMD ["functions-framework", "--target", "ingestion_helper"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the gcp functions-framework utility is the right tool to use here for hosting a rest service.

Is the plan to migrate all of these functions to cloud run services?

If so- i think it makes sense to put them in the DCP service as a new endpoint:

https://github.com/datacommonsorg/datacommons/tree/main/packages/datacommons-api/datacommons_api/

This way we wont need a new image or ci/cd pipeline for building the import automation docker image.

Also- moving from functions-framework to fastapi gives us better logging, request & response validations with Pydantic objects, and puts these APIs in a single place.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that functions-framework is a bit limited and odd in this case, but this is essentially the minimum work to migrate the existing cloud run function to a cloud run service executing the cloud run function. It allows dockerization for DCP, without bloating Base DC infra/

I think there needs to be much more discussion before moving this to DCP API, it certainly has some pros but also some major drawbacks such as moving away ingestion code used across base DC in this new DCP repo. It might be premature to move to DCP service, especially as DCP service is out of scope forM1.

If we leave it like this M1 ships 4 artifacts:

  • MCP PyPi package
  • Mixer image
  • ingestion helper image
  • dataflow template
    all connected through the Terraform deployment. If we move ingestion helper to DCP service, then we just need to maintain the DCP service image. Agreed functions framework is limited and not ideal, I think that migration can happen at any point in the future. There is no need to do it now, and there would be no complexity in migrating it to within DCP service when we launch DCP service (in q3/q4).
    Especially as ingestion helper is required for base DC, I dont think it makes sense for us to take this on now and we would have to deploy DCP service to Base DC for ingestion helper support.

Happy to hop on a call to discuss furhter!

Comment thread import-automation/workflow/ingestion-helper/README.md
@gmechali
Copy link
Copy Markdown
Author

Thanks Gabe!

I think this tooling may be better suited to live in the DCP repo at https://github.com/datacommonsorg/datacommons

This PR converts the import automation cloud functions to a REST service, but if we're going to do that, I think it makes sense to consolidate it with the DCP FastAPI service so we don't end up adding a separate process.

Then this action could look something like: POST https://<dcp-service>/api/admin/initialize-database

Regardless, i think we should also:

  • convert this toolset from requirements.txt files to uv so we can have consistent builds with the uv.lock files
  • right now the functions in main.py are tied to cloud run functions. we should migrate these to a plain python module, and the cloud run part should be minimal wrapper on top of it
  • we should expose these functions via a command line tool. If we migrate to the dcp repo, this might look like: $ datacommons admin initialize-database or $ datacommons admin release-ingestion-lock.

cc @vish-cs

Other comment kind of replies about moving to DCP repo. I think one day maybe but it's premature, and splitting up ingestion code also seems controversial.
Converting to UV - done!
Migrating from cloud functions to a plain python module, agreed but I dont think there's an immediate need for it? Can we track this post-M1?
Command line tool, again I think this is way past M1 right? All nice to haves, but are we aligned that these are future improvements, not for MVP?

@gmechali gmechali requested a review from dwnoble April 22, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants